Skip to content

scheduler: preserve prompt checkpoints in chunked prefill resume path#221

Open
krystophny wants to merge 2 commits intowaybarrios:mainfrom
computor-org:fix/chunked-prefill-prompt-checkpoints-upstream
Open

scheduler: preserve prompt checkpoints in chunked prefill resume path#221
krystophny wants to merge 2 commits intowaybarrios:mainfrom
computor-org:fix/chunked-prefill-prompt-checkpoints-upstream

Conversation

@krystophny
Copy link
Copy Markdown
Contributor

@krystophny krystophny commented Mar 24, 2026

Summary

Preserve prompt_checkpoints across chunked-prefill partial resume and finalization for mlx-lm prompt tuples, and invoke the upstream prompt_checkpoint_callback contract.

Why

Recent mlx-lm prompt tuples carry a seventh prompt_checkpoints field. Upstream one-line fixes cover the immediate unpack crash in _chunked_next, but the chunked-prefill monkeypatch also needs to retain that field when it stores partial progress and resumes generation later.

Additionally, the upstream BatchGenerator invokes prompt_checkpoint_callback after cache finalization. The chunked-prefill monkeypatch was missing this callback invocation, breaking the checkpoint contract.

What changed

  • update the chunked-prefill scheduler monkeypatch to accept 7-field prompt tuples
  • preserve prompt-checkpoint state through partial-prefill resume and finalization
  • import _lazy_extract_cache from mlx-lm and invoke prompt_checkpoint_callback at the correct semantic boundary (after c.finalize(), before the checkpoint tail model call)
  • add regression tests: one for partial chunked-prefill with prompt checkpoints, one verifying the callback fires with correct uid and checkpoint offset

Files to review

  • vllm_mlx/scheduler.py
  • tests/test_batching.py

Related PRs

  • broader than #194 and #156, which address only the immediate unpack failure in _chunked_next

Validation

$ python -m pytest tests/test_batching.py::TestSchedulerBasic -v
9 passed in 2.33s

@krystophny krystophny marked this pull request as draft March 24, 2026 19:08
@krystophny
Copy link
Copy Markdown
Contributor Author

Critical review found a blocking scheduler issue: the chunked-prefill path still does not invoke prompt_checkpoint_callback, so it does not preserve the upstream checkpoint contract yet. I am leaving this PR in draft until that callback path is fixed.

@Thump604
Copy link
Copy Markdown
Collaborator

This is the right approach. We had #194 open for the same crash (one-line unpack fix) but just closed it -- your PR is the only one of the four submissions that actually preserves checkpoint semantics through the chunked prefill lifecycle instead of silently discarding them.

The checkpoint-aware finalization step (feeding remaining checkpoint tokens through the model before generation) is what the others all miss.

Happy to test once the callback gap is resolved. We run chunked prefill on Qwen3.5-122B (M2 Ultra 128GB) with 2048-token prefill steps.

The upstream BatchGenerator contract requires prompt_checkpoint_callback
to fire after cache finalization, before the checkpoint tail model call.
The chunked-prefill monkeypatch preserved the checkpoint field but never
invoked the callback, breaking the upstream checkpoint contract.

Wire _lazy_extract_cache from mlx-lm and invoke the callback at the
correct semantic boundary. Add regression test verifying the callback
fires with the correct uid and checkpoint offset.
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR correctly addresses the callback gap that was flagged. The fix preserves prompt_checkpoints through partial resume and invokes the callback after finalization, which the prior submissions missed. The semantic changes to checkpoint-aware bounds checking are sound.

However, I have three concrete concerns before this merges:

  1. Test coverage is incomplete. test_chunked_prefill_accepts_prompt_checkpoints constructs a 7-tuple but never exercises the actual zip(*batch_prompts) unpack on line 414. test_chunked_prefill_invokes_checkpoint_callback doesn't validate the "checkpoint tail" model execution (lines 313-318), which is the most complex part of the fix. A test with checkpoint > 1 initially (not just in resume) would strengthen confidence.

  2. Line 425 has a suspicious defensive formula: (ln - pc if pc > 0 else -pc). Checkpoint offsets should always be positive. Why the negation when pc <= 0? This deserves a comment explaining the intent — is this dead code, or does mlx-lm ever produce zero/negative checkpoints?

  3. The checkpoint tail replay (lines 313-318) feeds tokens 0..checkpoint-1 through the model after finalize(). Is this the intended semantics? The comment on line 303 says "once only the checkpoint tail remains" but then we immediately replay it. If the cache already saw these tokens during chunked processing, feeding them again seems redundant. I need clarification: does finalize() clear cache state, or just mark it finalized? If it clears, the replay is necessary; if not, it may be redundant computation.

The implementation is not blocking — all three can be addressed post-merge. But they should be addressed before production use on 122B at 2048-token prefill steps. Otherwise this is good work fixing a real checkpoint contract violation.

@Thump604
Copy link
Copy Markdown
Collaborator

The PR correctly addresses the callback gap that was flagged. The fix preserves prompt_checkpoints through partial resume and invokes the callback after finalization, which the prior submissions missed. The semantic changes to checkpoint-aware bounds checking are sound.

However, I have three concrete concerns before this merges:

  1. Test coverage is incomplete. test_chunked_prefill_accepts_prompt_checkpoints constructs a 7-tuple but never exercises the actual zip(*batch_prompts) unpack on line 414. test_chunked_prefill_invokes_checkpoint_callback does not validate the "checkpoint tail" model execution (lines 313-318), which is the most complex part of the fix. A test with checkpoint > 1 initially (not just in resume) would strengthen confidence.

  2. Line 425 needs clarification. The formula (ln - pc if pc > 0 else -pc) looks defensive. Checkpoint offsets should always be positive. Why negate when pc <= 0? This deserves a comment explaining the intent. Is this dead code, or does mlx-lm ever produce zero/negative checkpoints?

  3. The checkpoint tail replay needs justification (lines 313-318). After finalize(), the code feeds tokens 0..checkpoint-1 through the model again. Does finalize() clear cache state, or just mark it finalized? If it clears, the replay is necessary; if not, it may be redundant computation. A comment or docstring explaining the semantics would help.

The implementation is not blocking. All three can be addressed post-merge. But they should be resolved before production use on 122B at 2048-token prefill steps. Otherwise this is good work fixing a real checkpoint contract violation that prior submissions missed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants